Automatic Termination Analysis for GPU Kernels∗
نویسندگان
چکیده
We describe a method for proving termination of massively parallel GPU kernels. An implementation in KITTeL is able to show termination of 94% of the 598 kernels in our benchmark suite.
منابع مشابه
Automatic Restructuring of GPU Kernels for Exploiting Inter-thread Data Locality
Hundreds of cores per chip and support for fine-grain multithreading have made GPUs a central player in today’s HPC world. For many applications, however, achieving a high fraction of peak on current GPUs, still requires significant programmer effort. A key consideration for optimizing GPU code is determining a suitable amount of work to be performed by each thread. Thread granularity not only ...
متن کاملGPU-UniCache: Automatic Code Generation of Spatial Blocking for Stencils on GPUs
Spatial blocking is a critical memory-access optimization to efficiently exploit the computing resources of parallel processors, such as many-core GPUs. By reusing cache-loaded data over multiple spatial iterations, spatial blocking can significantly lessen the pressure of accessing slow global memory. Stencil computations, for example, can exploit such data reuse via spatial blocking through t...
متن کاملInterleaving and Lock-Step Semantics for Analysis and Verification of GPU Kernels
We study semantics of GPU kernels — the parallel programs that run on Graphics Processing Units (GPUs). We provide a novel lock-step execution semantics for GPU kernels represented by arbitrary reducible control flow graphs and compare this semantics with a traditional interleaving semantics. We show for terminating kernels that either both semantics compute identical results or both behave err...
متن کاملEnabling Task Parallelism in the CUDA Scheduler
General purpose computing on graphics processing units (GPUs) introduces the challenge of scheduling independent tasks on devices designed for data parallel or SPMD applications. This paper proposes an issue queue that merges workloads that would underutilize GPU processing resources such that they can be run concurrently on an NVIDIA GPU. Using kernels from microbenchmarks and two applications...
متن کاملVectorized Higher Order Finite Difference Kernels
Several highly optimized implementations of Finite Difference schemes are discussed. The combination of vectorization and an interleaved data layout, spatial and temporal loop tiling algorithms, loop unrolling, and parameter tuning lead to efficient computational kernels in one to three spatial dimensions, truncation errors of order two to twelve, and isotropic and compact anisotropic stencils....
متن کامل